Resequencing and characterization of the first Corynebacterium pseudotuberculosis genome isolated from camel

Background Corynebacterium pseudotuberculosis is a zoonotic Gram-positive bacterial pathogen known to cause different diseases in many mammals, including lymph node abscesses in camels. Strains from biovars equi and ovis of C. pseudotuberculosis can infect camels. Comparative genomics could help to identify features related to host adaptation, and currently strain Cp162 from biovar equi is the only one from camel with a sequenced genome. Methods In this work, we compared the quality of three genome assemblies of strain Cp162 that used data from the DNA sequencing platforms SOLiD v3 Plus, IonTorrent PGM, and Illumina HiSeq 2500 with an optical map and investigate the unique features of this strain. For this purpose, we applied comparative genomic analysis on the different Cp162 genome assembly versions and included other 129 genomes from the same species. Results Since the first version of the genome, there was an increase of 88 Kbp and 121 protein-coding sequences, a decrease of pseudogenes from 139 to 53, and two inversions and one rearrangement corrected. We identified 30 virulence genes, none associated to the camel host, and the genes rpob2 and rbpA predicted to confer resistance to rifampin. In comparison to 129 genomes of the same species, strain Cp162 has four genes exclusively present, two of them code transposases and two truncated proteins, and the three exclusively absent genes lysG, NUDIX domain protein, and Hypothetical protein. All 130 genomes had the rifampin resistance genes rpob2 and rbpA. Our results found no unique gene that could be associated with tropism to camel host, and further studies should include more genomes and genome-wide association studies testing for genes and SNPs.


INTRODUCTION
Corynebacterium pseudotuberculosis is a zoonotic Gram-positive bacterium that causes caseous lymphadenitis (CLA) in various animals, including small ruminants, cattle, camelids, and other host disease manifestations.In this species, biovar equi is nitrate positive and biovar ovis is nitrate negative (Dorella et al., 2006).In Australian sheep, CLA causes estimated losses of $A12-$A15 million and $A17 million per year for the meat and wool industry, respectively (Baird & Fontaine, 2007).In Australia, the wild dromedary population in the interior has frequently exhibited unsightly lymph node abscesses.Similarly, in East Africa, a high prevalence of swollen external lymph nodes has been observed in almost all dromedaries, and it is believed that this may be linked to CLA resulting from the consumption of thorny plants.CLA mortality rates in dromedaries in countries other than Europe, where it can reach 15%, are unknown.However, death always occurs when the pathogen spreads into internal organs, mainly the lung and liver (Wernery & Kinne, 2016).
In this context, genomic data can be used for identification and taxonomy (Parks et al., 2022), evolutionary studies (Sheppard, Guttman & Fitzgerald, 2018), epidemiology (Gardy & Loman, 2018), and the development of control mechanisms such as drugs (Serral et al., 2021) and vaccines (Goodswen, Kennedy & Ellis, 2023).An ideal genome assembly should be complete, closed, and artifacts-free to avoid bias in analysis that relies on gene content, variant calling (Di Marco et al., 2023) and synteny (Mascher & Stein, 2014;Yuan, Chung & Chan, 2020).The highly accurate but short reads from recent second-generation DNA sequencing platforms result in assembly gaps with long repetitive sequences (Loman et al., 2012).Two strategies are used to solve this issue using short-read data: scaffolding using an optical map (Lehri, Seddon & Karlyshev, 2017) and a hybrid assembly, in which longer reads with lower accuracy from a third-generation DNA sequencing platform are used to generate an assembly that is error corrected using reads from a second-generation platform (Craddock et al., 2022;Di Marco et al., 2023).With the increasing quality of long-read sequencing and assembly algorithms, long-reads alone can be used for genome sequencing (Nurk et al., 2022;Sereika et al., 2022).
In C. pseudotuberculosis, it is known that horses and buffalo are only reported as hosts of the nitrate-positive biovar equi, but little is known about the mechanisms related to host tropism, besides the suggestion of diphtheria toxin as a requirement to infect buffalo (Viana et al., 2017).Strain Cp162 from the camel is currently the only strain from the camel with a sequenced genome.It was initially isolated from an external neck abscess of a camel in 1999 and was first sequenced in 2012 using the platform SOLiD v3 Plus (RefSeq accession NC_018019.1)(Hassan et al., 2012).The genome was then resequenced in 2017 using the Ion Torrent PGM platform (NC_018019.2) to improve genome assembly quality, and in 2019 using Illumina HiSeq 2500 with assembly using an optical map to improve the accuracy of the genome assembly (NC_018019.3)(Sousa et al., 2019).
In this study, we aimed to evaluate the improvements in the genome assemblies of strains Cp162, search for genes that could be related to tropism for camels and update the pangenome analysis of the species.

Analysis of Cp162 genome assemblies
We compared the three versions of Cp162 assemblies for completeness and contamination, size, gene content, and synteny.The number of genes was collected from the GBFF file of each genome.Differences in gene content were identified using Panaroo v1.3.3 (Tonkin-Hill et al., 2020) for gene clustering, with the parameters ''-remove-invalid-genes'' and ''clean-mode strict''.An in-house script was used for identifying exclusive genes (Data S2).Synteny was verified using Mauve v20150226 (Darling et al., 2004) with progressiveMauve algorithm.

Quality assessment and taxonomy
All genomes were classified as C. pseudotuberculosis.Completeness and contamination ranged between 97.37% and 100% and 0.14% and 6.43%, respectively.No evidence of chimerism was detected by GUNC (Data S1).

Analysis of Cp162 genome assemblies
The comparisons between the three assembly versions showed increased genome size and the number of coding sequences (CDS) (Table 1).Across all three versions, we identified 2,128 genes, 2,010 of them shared.About virulence genes, the first assembly version lacks the virulence genes nanH while spaC is fragmented as CP162_RS09080 and CP162_RS09085).The synteny analysis showed two inversions and one rearrangement in the first version, which were later corrected in the subsequent versions using the optical map (Fig. 1).Since the first assembly, the genome had an increase of 88 Kbp and 121 CDSs and a decrease of pseudogenes from 139 to 53.The third version has 18 exclusively present genes (  In the third assembly version, we predicted one incomplete prophage with 8.9 Kb and 18 CDSs (Data S5, S6 and S7), 13 GEIs, five of them pathogenicity islands (Fig. 2, Data S7), three CRISPR arrays, and a Type I-E CRISPR-Cas system (Data S7).PanViTa identified 30 virulence genes.The first and second assembly versions had strB and sapA as exclusive virulence genes, respectively (Table 1, Data S8 revealed that the CDSs classified as srtB (v1 locus_tag: CP162_RS09100, old_locus_tag: Cp162_1849) and sapA (v2 locus_tag: CP162_RS09670, old_locus_tag: Cp162_1968) were present in the three assembly versions, but with variations in size.PanViTa identified two antimicrobial resistance proteins: Rifampin-resistant beta-subunit of RNA polymerase (rpoB2, WP_041481489.1)and RbpA bacterial RNA polymerase-binding protein (rpbA, WP_014800420.1)(Table 1, Data S8).

Phylogeny and pangenome
A tree was generated by IQ-TREE2 using core genome alignment from Panaroo and MAFFT.The species tree in Fig. 3 shows two main clades.The first one contains a subclade composed of strains Cp162 (camel), G1 (alpaca), and I37 (cow, Israel) and another subclade containing strains isolated from horses and buffalo, separated by a host.The second one contains strain 262 (cow, Belgium) and all biovar ovis strains (collapsed) as its sister group.

Exclusive genes of Cp162
From the pangenome analysis, we also identified four proteins exclusively present in Cp162 and three exclusively absent in this lineage (Table 2).In the exclusively present group, two were predicted as the same transposase (WP_048653436.1).The other two are truncated (41 and 45 aa), none showed conserved domains, and one is in GEI 6. Analysis against the GenBank database using BLASTp and WP_275060758.1 as a query showed 92% of coverage with 52% of identity to an ATP-dependent helicase of Streptomyces spp.The same analysis with WP_231131458.1 showed 97.8% of identity with other truncated proteins from CpE19_1664 (AKS14002.1).From the group of proteins exclusively absent in Cp162, one was recognized by eggNOGmapper as a Transcriptional Regulator named lysG (COG category: K), another as an enzyme from NUDIX superfamily (COG category: L), and the last one as a hypothetical protein with no domains.

DISCUSSION
The improvements in long-read DNA sequencing platforms allowed the assembly of complete or near complete bacterial genomes (Sereika et al., 2022) and may eliminate in the future the use of optical map for contig ordering and short-reads for lower error rates.Currently, the combination of long-and short-reads is still relevant (Di Marco et al., 2023;Hepner et al., 2023).
Our results showed that using Illumina HiSeq and an optical map increased the genome size and number of CDSs, corrected misassembles, and reduced the number of pseudogenes (Table 1).Concerning synteny, the correct sequence order and content are required to study the genome plasticity events such as inversion, translocation, insertion, and deletions (Lehri, Seddon & Karlyshev, 2017).As shown in Fig. 1, optical mapping could correctly order contigs from sequencing.The rearranged regions are strictly between transposase genes, which could explain possible rearrangements (Hickman & Dyda, 2016).Some transposase sequences were found only in the third version of the Cp162 genome, within its exclusive 18 genes (Data S4).With frameshifts, the first and second genome versions were sequenced using SOLiD and Ion Torrent platforms, known for indel sequencing errors (Loman et al., 2012), leading to CDS frameshifts.The correct identification of pseudogenes is required for gene evolution analysis and for gene content studies such as pangenomics.In NCBI's PGAP annotation pipeline (Li et al., 2021), a pseudogene will not have a CDS annotation, while in the RAST-Tk pipeline (Brettin et al., 2015), each fragment of a pseudogene can be annotated as a CDS; this can lead to erroneous estimations of gene content across genomes and suggests that data generated using SOLiD and Ion Torrent should be used with caution.
In relation to virulence factors, we identified 30 in Cp162, and 35 distributed across the other genomes, with tox exclusively in strains isolated from buffalo (Table 1, Data S8).We could not associate a virulence gene to the camel isolate Cp162.In buffalo, the presence of the tox was suggested as a requirement to infect this host (Viana et al., 2017).
In relation to antimicrobial resistance genes, Cp162 and all the other genomes have the genes rpoB2 and rbpA (Data S8) that confer resistance to rifampin according to the CARD database (CARD accessions ARO:3000501 and ARO:3000245).Although rifampin mixed with other types of antimicrobials have been suggested by some authors (Heggelund et al., 2015;Sting et al., 2022), a recent study pointed that resistance to rifampin is present in some lineages infecting sheep and goats (El Damaty et al., 2023).This result suggests this antimicrobial should be avoided for infection treatment.Although C. pseudotuberculosis is susceptible to many antibiotic chemicals in vitro, the intracellular nature and encapsulation around lesions confers some protection (Baird & Fontaine, 2007).Common antibiotics used for treatment are penicillin or erythromycin combined with rifampin (Williamson, 2001).In the case of particularly valuable animals, surgical treatment of the lesions can be performed (Baird & Fontaine, 2007).
In Cp162 we predicted an incomplete prophage (Data S5 and S6) and 13 GEIs (Fig. 2, Data S7).GEI 5 is exclusive to the clade composed of Cp162 (camel), I37 (cow), and G1 (alpaca) (Fig. 2), but it may be due to a common ancestor rather than host tropism because strain 262 (equi) and I19 (ovis) also infect cows.The genome has three CRISPR arrays and a Type I-E CRISPR-Cas system (Table 1).Type I-E was previously found only in biovar equi in C. pseudotuberculosis, while proteins from Type III restriction-modification systems were exclusive from biovar ovis (Parise et al., 2018).
Cp162 is the only strain from a camel with a sequenced genome, and we looked for genes that could be involved in the tropism of this host by comparing its genome to 129 others from the same species.The exclusively present genes are transposases and truncated proteins, while the exclusively absent are lysG, an enzyme from the NUDIX superfamily and a hypothetical protein with no domains (Table 2).There is no clear relation between those genes and host tropism for camels.If there are any genome features related to tropism, they could be verified by sequencing the genomes of more strains from this host and performing a genome-wide association study (GWAS) testing for gene presence/absence or SNPs.
The phylogeny of 130 genomes (Fig. 3) supports the previous assumption that biovar ovis is a clade that originated from biovar equi, with its exclusive adaptations, and biovar equi as paraphyletic with two exclusive hosts (horse and buffalo) (Viana et al., 2018).Sampling more strains from camels could show they form exclusive clades in biovar equi and ovis that could suggest clonal expansion after host adaptation.The species pan-genome was estimated as closed (α > 1.00), which means that sequencing more genomes will not reveal new genes (Tettelin et al., 2008).

Figure 1
Figure 1 Alignment of the three genome assembly versions of Corynebacterium pseudotuberculosis strain Cp162.(A) Two inversions and one rearrangement in the first assembly.(B) Increase in genome size throughout the assemblies.Full-size DOI: 10.7717/peerj.16513/fig-1

Figure 3
Figure3Phylogenomic tree of Corynebacterium pseudotuberculosis genomes.The tree was built using the core genome identified and aligned using Panaroo and MAFFT, respectively.A phylogeny using the Maximum Likelihood method was built using IQ-TREE2 with 1,000 replicates of ultrafast bootstrap approximation and C. ulcerans NCTC7910 (not shown) as an outgroup.Bootstrap values are represented as a branch color scale that ranges from 85% (yellow) to 100% (blue).The biovar ovis clade is collapsed.Full-size DOI: 10.7717/peerj.16513/fig-3

Table 2 Exclusively present and absent genes in Corynebacterium pseudotuberculosis strain Cp162 (camel) in comparison to other 129 genomes of the same species.
Cluster of Orthologous Genes; GEI, Genomic Island; KEGG, Kyoto Encyclopedia of Genes and Genomes.